Step 0: Load and preprocess the data¶

In [ ]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
               'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
               'hours-per-week', 'native-country', 'income']

train_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

train_df = pd.read_csv(train_url, names=names, na_values='?')
train_df = train_df.dropna()

test_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'

test_df = pd.read_csv(test_url, names=names, na_values='?', skiprows=1)
test_df = test_df.dropna()

feature_names_before_process = list(train_df.columns)

Preprocess the dataset¶

It is necessary to first understand the dataset and make a preprocess for the data. We can learn from the code above that this dataset has 14 features 'age','workclass','fnlwgt','education','education-num','marital-status''occupation''relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country', and one target 'income'.

In order to make the trainning process more efficient and reduce some noises, we can merge some features with similar meaning, eg. 'education' and 'eduaction-num'; 'maritial-status' and 'relationship', and treat some non-figure features with extremely small prevalence as others.

In [ ]:
train_df = train_df.replace(' ?', ' Other')
test_df = test_df.replace(' ?', ' Other')
In [ ]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(nrows=2, ncols=4, figsize=(20, 10))

def plot_data_distribution(feature, ax):
    counts = train_df[feature].value_counts()

# Calculate the proportions
    proportions = counts / counts.sum()

     # Plot the pie chart
    ax.pie(counts, startangle=90, autopct='%.1f%%')
    ax.legend(proportions.index, loc=2, fontsize=8)
    ax.set_title(feature, size=20)

features_non_figure = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

for i, feature in enumerate(features_non_figure):
    row = i // 4
    col = i % 4
    plot_data_distribution(feature, axs[row, col])

plt.show()

We can learn from the plots above that some features have a small representation in the dataset, accounting for less than 5% of the total values. To simplify the data and avoid overfitting on these rare values, we group them together and label them as "Other".

In [ ]:
# Define a function to replace rare values with "other"
def replace_rare_values(feature, threshold):
    counts = train_df[feature].value_counts()
    rare_values = counts[counts / counts.sum() < threshold].index
    train_df.loc[train_df[feature].isin(rare_values), feature] = " Other"
    test_df.loc[test_df[feature].isin(rare_values), feature] = " Other"

# Call the function for each feature with a threshold of 1%
for feature in ['workclass', 'occupation', 'race', 'native-country', 'marital-status']:
    replace_rare_values(feature, 0.05)

Load the data¶

In [ ]:
#Target variable 
y_train = [1 if y == ' >50K' else 0 for y in train_df['income']]
#Model features
X_train = train_df[['age', 'workclass', 'fnlwgt', 'education-num', 'marital-status', 'occupation', 'race', 'sex', 'capital-gain', 'capital-loss', 
                    'hours-per-week','native-country']].copy()

#Target variable 
y_test = [1 if y == ' >50K.' else 0 for y in test_df['income']]
#Model features
X_test = test_df[['age', 'workclass', 'fnlwgt', 'education-num', 'marital-status', 'occupation', 'race', 'sex', 'capital-gain', 'capital-loss', 
                    'hours-per-week','native-country']].copy()

Encode the data¶

There are two main ways to encode data for machine learning: one-hot encoding and label encoding. Although a fundamental test I ran before to see the accuracy found little difference between the two methods, label encoding can be problematic for certain features like occupation and native country, as it assigns ranking numbers that may not be appropriate. Therefore, this coursework uses one-hot encoding to minimize bias in data preprocessing as much as possible.

In [ ]:
# Get dummy variables for categorical features in training set
X_train = pd.get_dummies(X_train)

# Get dummy variables for categorical features in test set
X_test = pd.get_dummies(X_test)

Step 1: Build and analyse the performance of multiple ML models¶

1.1 Large decision tree¶

Find the optimized hyperprameters in a relatively large range¶

In [ ]:
# find the optimized hyperprameters of large decision tree
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# the range of finding hyperprameter of large decision tree
param_grid_large = {'max_depth': range(6, 21, 2),
                    'min_samples_split': range(2, 7, 2),
                    'min_samples_leaf': range(2,11,2)
                    }

clf_large_dt = DecisionTreeClassifier(random_state=42)
grid_search_large_dt = GridSearchCV(clf_large_dt, param_grid_large, cv=5, scoring='accuracy')
grid_search_large_dt.fit(X_train, y_train)

# print the optimized hyperprameters
print('Best hyperparameters: {}'.format(grid_search_large_dt.best_params_))
print('Best accuracy: {:.3f}'.format(grid_search_large_dt.best_score_))
Best hyperparameters: {'max_depth': 10, 'min_samples_leaf': 10, 'min_samples_split': 2}
Best accuracy: 0.857

Plot the tree¶

In [ ]:
import os
os.environ["PATH"] += os.pathsep + 'F:/Graphviz/bin'

from sklearn.tree import export_graphviz
from sklearn.metrics import accuracy_score
import graphviz

# Build 1st decision tree model
clf_large_dt = DecisionTreeClassifier(max_depth=grid_search_large_dt.best_params_['max_depth'],
                             min_samples_split=grid_search_large_dt.best_params_['min_samples_split'],
                             min_samples_leaf = grid_search_large_dt.best_params_['min_samples_leaf'],
                             random_state=42)
clf_large_dt.fit(X_train, y_train)

y_pred_large_dt = clf_large_dt.predict(X_test)
acc = accuracy_score(y_test, y_pred_large_dt)
print('Accuracy:', acc)

# plot decision tree with Graphviz
dot_data = export_graphviz(clf_large_dt, out_file='./decision_tree_large.dot', 
                           feature_names=X_train.columns.tolist(),  
                           class_names=['less_than_fifty_thousands', 'greater_than_fifty_thousands'],  
                           filled=True, rounded=True,  
                           special_characters=True)

# save the png
! dot -Tpng ./decision_tree_large.dot -o ./decision_tree_large.png

# Display png
from IPython.display import Image
Image(filename='decision_tree_large.png')
Accuracy: 0.8628462625145875
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.610607 to fit
Out[ ]:

1.2 Small decision tree¶

Find the optimized hyperprameters in a relatively small range¶

In [ ]:
# find the optimized hyperprameters of small decision tree
# the range of finding hyperprameter of small decision tree
param_grid_small = {'max_depth': range(1, 4),
                    'min_samples_split': range(2, 7, 2),
                    'min_samples_leaf': range(1, 6)
}

clf_small_dt = DecisionTreeClassifier(random_state=42)
grid_search_small_dt = GridSearchCV(clf_small_dt, param_grid_small, cv=5, scoring='accuracy')
grid_search_small_dt.fit(X_train, y_train)

# print the optimized hyperprameters
print('Best hyperparameters: {}'.format(grid_search_small_dt.best_params_))
print('Best accuracy: {:.3f}'.format(grid_search_small_dt.best_score_))
Best hyperparameters: {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best accuracy: 0.844

Plot the tree¶

In [ ]:
# Build 1st decision tree model
clf_small_dt = DecisionTreeClassifier(max_depth=grid_search_small_dt.best_params_['max_depth'],
                             min_samples_split=grid_search_small_dt.best_params_['min_samples_split'],
                             min_samples_leaf = grid_search_small_dt.best_params_['min_samples_leaf'],
                             random_state=42)
clf_small_dt.fit(X_train, y_train)

y_pred_small_dt = clf_small_dt.predict(X_test)
acc = accuracy_score(y_test, y_pred_small_dt)
print('Accuracy:', acc)

# make decision tree tp Graphviz
dot_data = export_graphviz(clf_small_dt, out_file='./decision_tree_small.dot', 
                           feature_names=X_train.columns.tolist(),  
                           class_names=['less_than_fifty_thousands', 'greater_than_fifty_thousands'],  
                           filled=True, rounded=True,  
                           special_characters=True)

# save the png
! dot -Tpng ./decision_tree_small.dot -o ./decision_tree_small.png

# Display png
from IPython.display import Image
Image(filename='decision_tree_small.png')
Accuracy: 0.8447884036607088
Out[ ]:

1.3 Black-box: XGBoost¶

Find the optimized hyperprameters and fit the model¶

In [ ]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

# Difine classifier
clf_xgb = xgb.XGBClassifier(
    objective='binary:logistic',
    random_state=42
)

# hyperprameters range for optimiaztion
params_xgb = {
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': range(3,11),
}

# find optimized hyperprameters with grid search
grid_search = GridSearchCV(
    estimator=clf_xgb,
    param_grid=params_xgb,
    cv=5,
    scoring='roc_auc'
)

grid_search.fit(X_train, y_train)

# print the hyperprameters
print("Best parameters found: ", grid_search.best_params_)
print('Best accuracy: {:.3f}'.format(grid_search.best_score_))

# train the model
clf_xgb = xgb.XGBClassifier(**grid_search.best_params_, random_state=42)
clf_xgb.fit(X_train, y_train)

# predict the model
y_pred_xgb = clf_xgb.predict(X_test)
Best parameters found:  {'learning_rate': 0.1, 'max_depth': 6}
Best accuracy: 0.926

1.4 Compare the performance¶

AUC of three models¶

One way to compare the performace between models is to see the area under the curve, the bigger area indicates a better model.

In [ ]:
from sklearn.metrics import roc_curve, auc

fig, ax = plt.subplots()

models = {
    'clf_small_dt': clf_small_dt,
    'clf_large_dt': clf_large_dt,
    'xgb_clf': clf_xgb
}
colors = {
    'clf_small_dt': 'r',
    'clf_large_dt': 'g',
    'xgb_clf': 'b'
}

# calculate AUC value
for model_name, model in models.items():
    y_score = model.predict_proba(X_test)[:,1]
    fpr, tpr, _ = roc_curve(y_test, y_score)
    roc_auc = auc(fpr, tpr)
    ax.plot(fpr, tpr, color=colors[model_name], lw=2, alpha=0.8, label='{} (AUC = {:.2f})'.format(model_name, roc_auc))

# reference line
ax.plot([0, 1], [0, 1], linestyle='--', lw=1, color='gray', alpha=.8)

ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Receiver Operating Characteristic')
ax.legend(loc="lower right")
plt.show()

Accuracy, F1 score, precission and recall of three models¶

Another way is to list all the performance scores of the models

In [ ]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# compute performance
large_dt_acc = accuracy_score(y_test, y_pred_large_dt)
large_dt_f1 = f1_score(y_test, y_pred_large_dt)
large_dt_prec = precision_score(y_test, y_pred_large_dt)
large_dt_rec = recall_score(y_test, y_pred_large_dt)

small_dt_acc = accuracy_score(y_test, y_pred_small_dt)
small_dt_f1 = f1_score(y_test, y_pred_small_dt)
small_dt_prec = precision_score(y_test, y_pred_small_dt)
small_dt_rec = recall_score(y_test, y_pred_small_dt)

xgb_acc = accuracy_score(y_test, y_pred_xgb)
xgb_f1 = f1_score(y_test, y_pred_xgb)
xgb_prec = precision_score(y_test, y_pred_xgb)
xgb_rec = recall_score(y_test, y_pred_xgb)

# print in a table
performance_comparison = {'Model': ['y_pred_large_dt', 'y_pred_small_dt', 'y_pred_xgb'],
        'Accuracy': [large_dt_acc, small_dt_acc, xgb_acc],
        'F1-Score': [large_dt_f1, small_dt_f1, xgb_f1],
        'Precision': [large_dt_prec, small_dt_prec, xgb_prec],
        'Recall': [large_dt_rec, small_dt_rec, xgb_rec]}

df_performance_comparison = pd.DataFrame(performance_comparison)
print(df_performance_comparison)
             Model  Accuracy  F1-Score  Precision    Recall
0  y_pred_large_dt  0.862846  0.668645   0.778776  0.585803
1  y_pred_small_dt  0.844788  0.606815   0.755521  0.507020
2       y_pred_xgb  0.874393  0.704864   0.792086  0.634945

From the table, it can be observed that the XGBoost model (y_pred_xgb) performs the best among the three models for all the metrics, followed by the large decision tree model (y_pred_large_dt), and the small decision tree model (y_pred_small_dt) performs the worst. This suggests that the XGBoost model is the best choice for this particular task. However, evaluation and testing on different tasks and datasets are necessary to determine which model is the best choice.

Step 2: Interpret the models¶

2.1 Interpret the XGBoost¶

2.1.1 Global interpretation for small decision tree¶

SHAP is used for global interpretation of models. In a SHAP summary plot, the features that are closer to the top have a greater influence on the model. The plot is composed of individual points.

In [ ]:
import shap
shap.initjs()

# create SHAP values
explainer = shap.TreeExplainer(clf_xgb)
shap_values = explainer.shap_values(X_test)

# create summary plot
shap.summary_plot(shap_values, X_test, feature_names=X_train.columns)

It can be observed from the shap summary plot above that the top 5 influential features are respectively 'marital-status_ Married-civ-spouse', 'age', 'education-num', 'capital gain', 'hours-per-week'.

In [ ]:
shap.force_plot(explainer.expected_value, shap_values[:1000,:], X_test.iloc[:1000,:])
Out[ ]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

With force plot above, we can see the predection of any individual point by putting the mouse on it.

2.1.2 Local interpretation for small decision tree¶

SHAP is also used in this section for local interpret for XGBoost.

In [ ]:
# plot the SHAP values for a single prediction
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0,:], feature_names=X_train.columns)
Out[ ]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

It can be observed from the shap plot above that the first sample in the dataset (real income: "< 50k") that this data is predicted to be income <50K with -5.84 SHAP value. The only positive feature iis sexfemale. Among the negative features, from high to low, are not being 'marital-status Married-civ-spouse', age '25', education-num '7', occupation 'Machine-op-inspct', marital-status: 'Never-married', hours-per-week: '40', not being 'race_ White', and capital-gain: '0'.

2.2 Interpret the large decision tree¶

2.2.1 Global interpretation for large decision tree¶

In [ ]:
Image(filename='decision_tree_large.png')
Out[ ]:

Although decision tree is interpretable, it is difficult to interpret a large decision tree. Hence, We can use the featureimportances from scikit-learn to check the accurate weight of these features.

In [ ]:
import numpy as np 

importances = clf_large_dt.feature_importances_
indices = np.argsort(importances)[::-1][:10] 

print("Feature ranking:")
for f in range(10):
    print("%d. %s (%f)" % (f + 1, X_train.columns[indices[f]], importances[indices[f]]))

plt.figure(figsize=(8,6))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [X_train.columns[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Feature ranking:
1. marital-status_ Married-civ-spouse (0.394698)
2. education-num (0.214734)
3. capital-gain (0.193278)
4. capital-loss (0.062832)
5. age (0.048954)
6. hours-per-week (0.034893)
7. occupation_ Exec-managerial (0.011682)
8. fnlwgt (0.010176)
9. workclass_ Self-emp-not-inc (0.007098)
10. sex_ Female (0.006227)

Given that the large decision has 10 max depths, the 10 influencial features are from high to low respectively: 'marital-status Married-civ-spouse', 'education-num', 'capital-gain', 'capital-loss', 'age', 'hours-per-week', 'occupation Exec-managerial', 'fnlwgt', 'workclass Self-emp-not-inc', 'sex Female'.

2.2.2 Local interpretation for large decision tree¶

Lime is used in this section for local interpret.

In [ ]:
import lime
import lime.lime_tabular

# reset index
X_train = X_train.reset_index(drop=True)

# generate explainer
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values, feature_names=X_train.columns,
                                                   class_names=[0,1], discretize_continuous=True)
# explain the point
i = np.random.RandomState(42).randint(0, X_test.shape[0])
exp = explainer.explain_instance(X_test.iloc[i], clf_large_dt.predict_proba, num_features=5)
# show the plot
exp.show_in_notebook(show_table=True, show_all = False)

From the lime framework above, it can be observed that this test sample is predicted to be 53% as '<=50k' and 47% as '>50k'. In detail, this percentage is composed of negative 70% for 'capital-gain' <= 0, positive 18% for 'marital-status_ Married-civ-spouse', negative 10% for 'capital-loss' <= 0, negative 7% for 'hours-per-week' <= 40, negative 4% for not being occupation_Exec-managerial.

2.3 Interpret the small decision tree¶

2.3.1 Global interpretation for small decision tree¶

A small decision tree can typically be treated as a globally-interpretable model because we can directly check its depth to understand the relative importance of features in the model.

In [ ]:
Image(filename='decision_tree_small.png')
Out[ ]:

By observing the depth of the decision tree shown in this image, we can identify the top three most influential features based on the three deepest levels, which are marital-status_Married-civ-spouse, capital-gain, and education-num.

We can also use the featureimportances from scikit-learn to check the accurate weight of these features.

In [ ]:
import numpy as np 

importances = clf_small_dt.feature_importances_
indices = np.argsort(importances)[::-1][:3] 

print("Feature ranking:")
for f in range(3):
    print("%d. %s (%f)" % (f + 1, X_train.columns[indices[f]], importances[indices[f]]))

plt.figure(figsize=(8,6))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [X_train.columns[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Feature ranking:
1. marital-status_ Married-civ-spouse (0.524028)
2. capital-gain (0.245267)
3. education-num (0.229026)

The three features are marital-status_ Married-civ-spouse (0.524028), capital-gain (0.245267) and education-num (0.229026), which are the same as the large decision tree with different ranking.

2.3.2 Local interpretation for small decision tree¶

Given that the small decision tree has only 3 max depth, we can easily go through the tree leaves step by step to interpret the model with the three features

Lime is used in this section for local interpret.

In [ ]:
import lime
import lime.lime_tabular

# reset index
X_train = X_train.reset_index(drop=True)

# generate explainer
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values, feature_names=X_train.columns,
                                                   class_names=[0,1], discretize_continuous=True)
# explain the point
i = np.random.RandomState(42).randint(0, X_test.shape[0])
exp = explainer.explain_instance(X_test.iloc[i], clf_small_dt.predict_proba, num_features=5)
# show the plot
exp.show_in_notebook(show_table=True, show_all = False)

It can be observed from LIME that this combination of data is predicted with a 70% chance of being classified as "<50K" income. This percentage is composed of 67% negative for capital-gain<=0, 30% positive for Married-civ-spouse, 5% negative for education-num > 9, 2% positive for marital-status not in 'other', and 1% negative for marital-status not in 'Never-married'.

2.4 Conclusion on interpretation¶

Several conclusions can be drawn from the interpretations above:

  1. The top influential features are the same between the small and large decision trees, but their rankings differ.
  2. Both XGBoost and the small decision tree make predictions based on the presence or absence of certain features.
  3. Some features that are supposed to have equal weight, such as sex: female and race: white, show different or even higher influence on the model, which can lead to errors or bias.

Step 3: Error analysis¶

One-hot encoded data are more likely to generate bias because the model may misunderstand the connection between features rather than continuous data, e.g., age. Therefore, this part will mainly focus on the one-hot encoded data races, sexes, marital statuses, occupations, and countries.

Define some functions for combination¶

In [ ]:
races = ['Black', 'White', 'Other']
sexes = ['Female', 'Male']
marital_statuses = ['Divorced', 'Married-civ-spouse', 'Never-married', 'Other']
occupations = ['Adm-clerical', 'Craft-repair', 'Exec-managerial', 'Machine-op-inspct', 
               'Other', 'Other-service', 'Prof-specialty', 'Sales']
countries = ['Other', 'United-States']
In [ ]:
# define a function to print missclassification rate between two conbinations
from sklearn.metrics import confusion_matrix

# define a function to plot pictures of missclassification rate between two conbinations
def plot_combo_error_rate(model_pred, feat1, feat2, feat1_prefix, feat2_prefix):
    misclassification_rates = []
    for feature_1 in feat1:
        for feature_2 in feat2:
            idx = (X_test[feat1_prefix + feature_1] == 1) & (X_test[feat2_prefix + feature_2] == 1)
            y_test_subset = np.array(y_test)[idx]
            y_pred_subset = np.array(model_pred)[idx]
            labels = clf_small_dt.classes_
            cm = confusion_matrix(y_test_subset, y_pred_subset, labels=labels)
            FP = cm[0, 1]
            FN = cm[1, 0]
            if np.sum(cm) == 0:
                misclassification_rate = np.nan
            else:
                misclassification_rate = (FP + FN) / np.sum(cm)
            misclassification_rates.append((feature_1, feature_2, misclassification_rate))

    misclassification_df = pd.DataFrame(misclassification_rates, columns=['feature1', 'feature2', 'Misclassification Rate'])

    fig, ax = plt.subplots(figsize=(16, 2))
    misclassification_df.groupby(['feature1', 'feature2']).mean()['Misclassification Rate'].unstack().plot.bar(ax=ax)
    ax.set_title('Misclassification Rates by '+feat1_prefix[:-2]+' and '+feat2_prefix[:-2])
    ax.set_xlabel(feat1_prefix[:-2])
    ax.set_ylabel('Misclassification Rate')
    plt.xticks(rotation=0)
    plt.show()

def print_false_rates_table(model_pred, feat1, feat2, feat1_prefix, feat2_prefix):
    false_positive_rates = []
    false_negative_rates = []
    for feature_1 in feat1:
        for feature_2 in feat2:
            idx = (X_test[feat1_prefix + feature_1] == 1) & (X_test[feat2_prefix + feature_2] == 1)
            y_test_subset = np.array(y_test)[idx]
            y_pred_subset = np.array(model_pred)[idx]
            labels = clf_small_dt.classes_
            cm = confusion_matrix(y_test_subset, y_pred_subset, labels=labels)
            TN = cm[0, 0]
            FP = cm[0, 1]
            FN = cm[1, 0]
            TP = cm[1, 1]
            if (FP + TN) == 0:
                false_positive_rate = np.nan
            else:
                false_positive_rate = FP / (FP + TN)
            if (FN + TP) == 0:
                false_negative_rate = np.nan
            else:
                false_negative_rate = FN / (FN + TP)
            false_positive_rates.append((feature_1, feature_2, false_positive_rate))
            false_negative_rates.append((feature_1, feature_2, false_negative_rate))

    # transfer to dataframe
    false_positive_df = pd.DataFrame(false_positive_rates, columns=['feature1', 'feature2', 'False Positive Rate'])
    false_negative_df = pd.DataFrame(false_negative_rates, columns=['feature1', 'feature2', 'False Negative Rate'])

    # merge the dataframes
    rates_df = pd.merge(false_positive_df, false_negative_df, on=['feature1', 'feature2'])

    # print in table
    print('{:<20s}{:<20s}{:<20s}{:<20s}'.format(feat1_prefix[:-2], feat2_prefix[:-2], 'False Positive Rate', 'False Negative Rate'))
    for index, row in rates_df.iterrows():
        print('{:<20s}{:<20s}{:<20.6f}{:<20.6f}'.format(row['feature1'], row['feature2'], row['False Positive Rate'], row['False Negative Rate']))

The funstions will plot a table or diagram to see the misclassification rate with any two features combination

3.1 Error analysis in XGBoost¶

3.1.1 Bias in sex¶

In [ ]:
plot_combo_error_rate(y_pred_xgb,races, sexes, 'race_ ', 'sex_ ')
plot_combo_error_rate(y_pred_xgb,marital_statuses, sexes, 'marital-status_ ', 'sex_ ')
plot_combo_error_rate(y_pred_xgb,occupations, sexes, 'occupation_ ', 'sex_ ')
plot_combo_error_rate(y_pred_xgb,countries, sexes, 'native-country_ ', 'sex_ ')

From the combinations above, it can be clearly observed that males are more biased than females, as males are indicated to have a higher misclassification rate. This can be caused by several reasons. After observing the initial data separately, the reasons can be:

  1. The samples are not equally distributed in the training set, with males accounting for 67%, which is much higher than females' 33%.
  2. Some combinations, such as sex_male and occupation_Sales or sex_male and native-country_Other, are over-considered to have a link to come to a pariticular prediction

3.1.2 Bias in race¶

In [ ]:
plot_combo_error_rate(y_pred_xgb,sexes, races, 'sex_ ', 'race_ ')
plot_combo_error_rate(y_pred_xgb,marital_statuses, races, 'marital-status_ ', 'race_ ')
plot_combo_error_rate(y_pred_xgb,occupations, races, 'occupation_ ', 'race_ ')
plot_combo_error_rate(y_pred_xgb,countries, races, 'native-country_ ', 'race_ ')

From the combinations above, it can be observed that 'race_Other' has the highest misclassification rate, while 'race_White' has a higher misclassification rate than 'race_Black' in most cases. After observing the initial data separately, the reasons could be:

  1. 'race_Other' is a combination of multiple races, which has two features - low proportion and potentially different connection to the prediction, both of which may raise the misclassification rate.
  2. 'race_White' has a much larger proportion than 'race_Black', indicating a data imbalance issue.

3.1.3 Bias in maritial status¶

In [ ]:
plot_combo_error_rate(y_pred_xgb,sexes, marital_statuses, 'sex_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_xgb,races, marital_statuses, 'race_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_xgb,occupations, marital_statuses, 'occupation_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_xgb,countries, marital_statuses, 'native-country_ ', 'marital-status_ ')

After examining the misclassification rates of different marital statuses, it is noticeable that the category 'Married-civ-spouse' has a significantly higher misclassification rate than the other categories. This feature is also the most influential in the three models. However, there is no specific combination or data imbalance that can account for this misclassification. Therefore, it is necessary to further examine the false positive rate and false negative rate of the 'Married-civ-spouse' category to better understand the bias.

In [ ]:
print_false_rates_table(y_pred_xgb,sexes, marital_statuses, 'sex_ ', 'marital-status_ ')
sex                 marital-status      False Positive Rate False Negative Rate 
Female              Divorced            0.003241            0.684211            
Female              Married-civ-spouse  0.135193            0.260504            
Female              Never-married       0.001272            0.688172            
Female              Other               0.003881            0.688889            
Male                Divorced            0.013369            0.460177            
Male                Married-civ-spouse  0.149246            0.335264            
Male                Never-married       0.003176            0.570470            
Male                Other               0.013263            0.508475            
In [ ]:
print_false_rates_table(y_pred_xgb,races, marital_statuses, 'race_ ', 'marital-status_ ')
race                marital-status      False Positive Rate False Negative Rate 
Black               Divorced            0.004785            0.533333            
Black               Married-civ-spouse  0.103203            0.468966            
Black               Never-married       0.000000            0.600000            
Black               Other               0.000000            0.555556            
White               Divorced            0.007652            0.563830            
White               Married-civ-spouse  0.149986            0.318652            
White               Never-married       0.002828            0.614679            
White               Other               0.006920            0.586207            
Other               Divorced            0.000000            0.600000            
Other               Married-civ-spouse  0.169399            0.360000            
Other               Never-married       0.000000            0.642857            
Other               Other               0.029851            0.625000            
In [ ]:
print_false_rates_table(y_pred_xgb,occupations, marital_statuses, 'occupation_ ', 'marital-status_ ')
occupation          marital-status      False Positive Rate False Negative Rate 
Adm-clerical        Divorced            0.002755            0.600000            
Adm-clerical        Married-civ-spouse  0.144876            0.469027            
Adm-clerical        Never-married       0.000000            0.866667            
Adm-clerical        Other               0.000000            0.800000            
Craft-repair        Divorced            0.005000            0.600000            
Craft-repair        Married-civ-spouse  0.049102            0.661098            
Craft-repair        Never-married       0.000000            0.700000            
Craft-repair        Other               0.000000            0.700000            
Exec-managerial     Divorced            0.041096            0.447761            
Exec-managerial     Married-civ-spouse  0.501348            0.090446            
Exec-managerial     Never-married       0.007463            0.559322            
Exec-managerial     Other               0.056818            0.482759            
Machine-op-inspct   Divorced            0.000000            0.666667            
Machine-op-inspct   Married-civ-spouse  0.016393            0.696429            
Machine-op-inspct   Never-married       0.000000            0.600000            
Machine-op-inspct   Other               0.000000            0.500000            
Other               Divorced            0.000000            0.736842            
Other               Married-civ-spouse  0.045640            0.613208            
Other               Never-married       0.000734            0.565217            
Other               Other               0.003268            0.692308            
Other-service       Divorced            0.000000            0.555556            
Other-service       Married-civ-spouse  0.003195            0.673077            
Other-service       Never-married       0.000000            0.600000            
Other-service       Other               0.000000            1.000000            
Prof-specialty      Divorced            0.014778            0.584906            
Prof-specialty      Married-civ-spouse  0.611650            0.069612            
Prof-specialty      Never-married       0.011494            0.623656            
Prof-specialty      Other               0.027397            0.500000            
Sales               Divorced            0.000000            0.625000            
Sales               Married-civ-spouse  0.213759            0.315914            
Sales               Never-married       0.003120            0.593750            
Sales               Other               0.000000            0.714286            
In [ ]:
print_false_rates_table(y_pred_xgb,countries, marital_statuses, 'native-country_ ', 'marital-status_ ')
native-country      marital-status      False Positive Rate False Negative Rate 
Other               Divorced            0.000000            0.600000            
Other               Married-civ-spouse  0.104000            0.350746            
Other               Never-married       0.000000            0.785714            
Other               Other               0.006135            0.666667            
United-States       Divorced            0.007555            0.559585            
United-States       Married-civ-spouse  0.153697            0.325066            
United-States       Never-married       0.002560            0.593458            
United-States       Other               0.007092            0.576087            
In [ ]:
print(sum(y_train))
7841

It can be concluded from the tables above that the model has a significantly higher False Negative Rate than False Positive Rate, with most values being over 50%. Upon checking the positive results of the target variable in the dataset, it appears that there is an imbalance in the collection of the target variable, which may be a cause for the high False Negative Rate.

3.1.4 Bias in countries¶

In [ ]:
plot_combo_error_rate(y_pred_xgb,sexes, countries, 'sex_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_xgb,races, countries, 'race_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_xgb,marital_statuses, countries, 'marital-status_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_xgb,occupations, countries, 'occupation_ ', 'native-country_ ')

Although the countries data are not well-balanced, no specific bias is observed in this case.

3.1.5 Bias in occupations¶

In [ ]:
plot_combo_error_rate(y_pred_xgb,sexes, occupations, 'sex_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_xgb,races, occupations, 'race_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_xgb,marital_statuses, occupations, 'marital-status_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_xgb,countries, occupations, 'native-country_ ', 'occupation_ ')

Given that the occupation has the most significant impact on the prediction, no particular bias is observed in this case.

3.2 Error analysis in large decision tree¶

3.2.1 Bias in sex¶

In [ ]:
plot_combo_error_rate(y_pred_large_dt,races, sexes, 'race_ ', 'sex_ ')
plot_combo_error_rate(y_pred_large_dt,marital_statuses, sexes, 'marital-status_ ', 'sex_ ')
plot_combo_error_rate(y_pred_large_dt,occupations, sexes, 'occupation_ ', 'sex_ ')
plot_combo_error_rate(y_pred_large_dt,countries, sexes, 'native-country_ ', 'sex_ ')

3.2.2 Bias in race¶

In [ ]:
plot_combo_error_rate(y_pred_large_dt,sexes, races, 'sex_ ', 'race_ ')
plot_combo_error_rate(y_pred_large_dt,marital_statuses, races, 'marital-status_ ', 'race_ ')
plot_combo_error_rate(y_pred_large_dt,occupations, races, 'occupation_ ', 'race_ ')
plot_combo_error_rate(y_pred_large_dt,countries, races, 'native-country_ ', 'race_ ')

3.2.3 Bias in maritial status¶

In [ ]:
plot_combo_error_rate(y_pred_large_dt,sexes, marital_statuses, 'sex_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_large_dt,races, marital_statuses, 'race_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_large_dt,occupations, marital_statuses, 'occupation_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_large_dt,countries, marital_statuses, 'native-country_ ', 'marital-status_ ')

3.2.4 Bias in countries¶

In [ ]:
plot_combo_error_rate(y_pred_large_dt,sexes, countries, 'sex_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_large_dt,races, countries, 'race_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_large_dt,marital_statuses, countries, 'marital-status_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_large_dt,occupations, countries, 'occupation_ ', 'native-country_ ')

3.2.5 Bias in occupations¶

In [ ]:
plot_combo_error_rate(y_pred_large_dt,sexes, occupations, 'sex_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_large_dt,races, occupations, 'race_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_large_dt,marital_statuses, occupations, 'marital-status_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_large_dt,countries, occupations, 'native-country_ ', 'occupation_ ')

3.3 Error analysis in small decision tree¶

3.3.1 Bias in sex¶

In [ ]:
plot_combo_error_rate(y_pred_small_dt,races, sexes, 'race_ ', 'sex_ ')
plot_combo_error_rate(y_pred_small_dt,marital_statuses, sexes, 'marital-status_ ', 'sex_ ')
plot_combo_error_rate(y_pred_small_dt,occupations, sexes, 'occupation_ ', 'sex_ ')
plot_combo_error_rate(y_pred_small_dt,countries, sexes, 'native-country_ ', 'sex_ ')

3.3.2 Bias in race¶

In [ ]:
plot_combo_error_rate(y_pred_small_dt,sexes, races, 'sex_ ', 'race_ ')
plot_combo_error_rate(y_pred_small_dt,marital_statuses, races, 'marital-status_ ', 'race_ ')
plot_combo_error_rate(y_pred_small_dt,occupations, races, 'occupation_ ', 'race_ ')
plot_combo_error_rate(y_pred_small_dt,countries, races, 'native-country_ ', 'race_ ')

3.3.3 Bias in maritial status¶

In [ ]:
plot_combo_error_rate(y_pred_small_dt,sexes, marital_statuses, 'sex_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_small_dt,races, marital_statuses, 'race_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_small_dt,occupations, marital_statuses, 'occupation_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_small_dt,countries, marital_statuses, 'native-country_ ', 'marital-status_ ')

3.3.4 Bias in countries¶

In [ ]:
plot_combo_error_rate(y_pred_small_dt,sexes, countries, 'sex_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_small_dt,races, countries, 'race_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_small_dt,marital_statuses, countries, 'marital-status_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_small_dt,occupations, countries, 'occupation_ ', 'native-country_ ')

3.3.5 Bias in occupations¶

In [ ]:
plot_combo_error_rate(y_pred_small_dt,sexes, occupations, 'sex_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_small_dt,races, occupations, 'race_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_small_dt,marital_statuses, occupations, 'marital-status_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_small_dt,countries, occupations, 'native-country_ ', 'occupation_ ')

3.4 Conclusion¶

Inside the model¶

It can be observed that there are similar biases in all the three models as explained in the error analysis of XGBoost, mainly on the features 'sex_male', 'race_other', 'race_white', and 'marital-status_Married-civ-spouse'. These biases may be caused by several reasons, such as imbalance in features, imbalance in targets, or possibly wrong connections between unrelated features by the model.

To mitigate these biases, some sensible solutions can be considered, such as:

  1. Collecting data in a balanced way to ensure that all features have similar proportions in the dataset.
  2. Using oversampling or undersampling techniques on the collected data based on the distribution of the features to balance out the dataset. It is important to note that there is often a tradeoff between bias and accuracy, and finding the right balance is crucial in machine learning. While reducing bias can improve the fairness of the model, it may come at the expense of accuracy, which may reduce the overall performance of the model. Therefore, it is essential to carefully consider the tradeoff between bias and accuracy when making decisions about the model's design and implementation.

Among models¶

Among the three models, trade-offs between bias and overall accuracy are also observed. The XGBoost model has the highest overall accuracy, but also has the most significant bias. On the other hand, the small decision tree model has the lowest overall accuracy, but also has the least significant bias. It is important to carefully balance accuracy and bias when choosing a model. Overall, the small decision tree is the fairest model and the XGBoost is the least fair.

Step 4: What-if¶

After preprocessing the data, we end up with two types of data in the dataset: continuous data with integer values (e.g. age, education-num) and one-hot encoded data which are binary values (e.g. race, occupation). These two types of data need to be treated separately in order to perform the what-if analysis.

4.1 What-if on continous value features¶

In [ ]:
# First define a function with input of classifier
import random

def cont_what_if(clf, feature, value_added, samples=1000):
    random_indices = random.sample(range(len(X_test)), samples)
    X_what_if = X_test.iloc[random_indices].copy()
    X_what_if[feature] += value_added
    y_pred = clf.predict(X_what_if)
    y_orig = clf.predict(X_test.iloc[random_indices])
    change = (y_pred != y_orig).mean()
    if np.mean(y_pred - y_orig) < 0:
        direction = "decreased"
    else:
        direction = "increased"
    print("Proportion of predictions changed after changing {} by {} for the sample: {} {:.2f}%"
          .format(feature, value_added,direction, change * 100))

We can use the function to see if one of these continous value features would happen after adding or minusing a certain value. The output would be the percentage change in predictions.

The function randomly picks 1000 samples at a time, hence the continous values will be running three times to see if they fits the linear pattern as designed.

What-if: Age¶

In [ ]:
cont_what_if(clf_xgb, 'age', 10, 1000)
cont_what_if(clf_xgb, 'age', -5, 1000)
cont_what_if(clf_xgb, 'age', 15, 1000)
Proportion of predictions changed after changing age by 10 for the sample: increased 3.90%
Proportion of predictions changed after changing age by -5 for the sample: decreased 2.80%
Proportion of predictions changed after changing age by 15 for the sample: increased 4.90%
In [ ]:
cont_what_if(clf_small_dt, 'age', 10, 1000)
cont_what_if(clf_small_dt, 'age', -5, 1000)
cont_what_if(clf_small_dt, 'age', 15, 1000)
Proportion of predictions changed after changing age by 10 for the sample: increased 0.10%
Proportion of predictions changed after changing age by -5 for the sample: increased 0.00%
Proportion of predictions changed after changing age by 15 for the sample: increased 0.00%
In [ ]:
cont_what_if(clf_large_dt, 'age', 10, 1000)
cont_what_if(clf_large_dt, 'age', -5, 1000)
cont_what_if(clf_large_dt, 'age', 15, 1000)
Proportion of predictions changed after changing age by 10 for the sample: increased 1.40%
Proportion of predictions changed after changing age by -5 for the sample: decreased 2.20%
Proportion of predictions changed after changing age by 15 for the sample: increased 1.80%

By changing the age value, it can be observed a linear change in prediction rate as expected, where the prediction rate increased as age increased and decreased as age decreased. Furthermore, the sensitivity of the prediction change between the models confirmed the interpretation that age is the second most influential feature in XGBoost, fifth in the large decision tree, and not present in the small decision tree. This is reflected in the largest change rate with clf_xgb and no change in clf_small_dt.

Given that both decision trees haven't cover all features, the what-if will run on clf_xgb in the following parts.

What-if: education-num¶

In [ ]:
cont_what_if(clf_xgb, 'education-num', 2, 1000)
cont_what_if(clf_xgb, 'education-num', 3, 1000)
cont_what_if(clf_xgb, 'education-num', -2, 1000)
Proportion of predictions changed after changing education-num by 2 for the sample: increased 6.40%
Proportion of predictions changed after changing education-num by 3 for the sample: increased 10.70%
Proportion of predictions changed after changing education-num by -2 for the sample: decreased 6.70%

What-if: hours-per-week¶

In [ ]:
cont_what_if(clf_xgb, 'hours-per-week', 5, 1000)
cont_what_if(clf_xgb, 'hours-per-week', 10, 1000)
cont_what_if(clf_xgb, 'hours-per-week', -5, 1000)
Proportion of predictions changed after changing hours-per-week by 5 for the sample: increased 2.50%
Proportion of predictions changed after changing hours-per-week by 10 for the sample: increased 2.20%
Proportion of predictions changed after changing hours-per-week by -5 for the sample: decreased 2.00%

It can be observed that the model treats education-num and hours-per-week as linear as predicted, the predection increases as these values increase, decreases as these values decrease.

What-if: capital-gain¶

In [ ]:
cont_what_if(clf_xgb, 'capital-gain', 2000, 1000)
cont_what_if(clf_xgb, 'capital-gain', 4000, 1000)
cont_what_if(clf_xgb, 'capital-gain', 6000, 1000)
Proportion of predictions changed after changing capital-gain by 2000 for the sample: decreased 13.00%
Proportion of predictions changed after changing capital-gain by 4000 for the sample: decreased 14.80%
Proportion of predictions changed after changing capital-gain by 6000 for the sample: increased 29.20%

What-if: capital-loss¶

In [ ]:
cont_what_if(clf_xgb, 'capital-loss', 2000, 1000)
cont_what_if(clf_xgb, 'capital-loss', 4000, 1000)
cont_what_if(clf_xgb, 'capital-loss', 6000, 1000)
Proportion of predictions changed after changing capital-loss by 2000 for the sample: decreased 12.70%
Proportion of predictions changed after changing capital-loss by 4000 for the sample: increased 18.60%
Proportion of predictions changed after changing capital-loss by 6000 for the sample: increased 20.90%

Given that most capital-gain and capital-loss are 0, it can be concluded that the model treats a small positive capital-gain or negative capital-loss as a potential increase of income, while a large positive capital-gain or negative capital-loss as a potential decrease of income. This observation may indicate that the model is considering the presence or absence of investment gains or losses as a proxy for wealth or financial stability, and assuming that individuals with higher wealth or financial stability are more likely to have higher income. However, this assumption may not hold true for all cases, and further investigation is needed to fully understand the impact of capital-gain and capital-loss on the model's prediction.

What-if: fnlwgt¶

In [ ]:
cont_what_if(clf_xgb, 'fnlwgt', -2000, 1000)
cont_what_if(clf_xgb, 'fnlwgt', 4000, 1000)
cont_what_if(clf_xgb, 'fnlwgt', 6000, 1000)
Proportion of predictions changed after changing fnlwgt by -2000 for the sample: decreased 0.20%
Proportion of predictions changed after changing fnlwgt by 4000 for the sample: decreased 0.20%
Proportion of predictions changed after changing fnlwgt by 6000 for the sample: increased 0.60%

As expected, fnlwgt has low influence on the model.

4.2 What-if on one-hot encoded features¶

In [ ]:
# First define a function with input of classifier
import random

def bin_what_if(clf, feature_prefix, feature, target_feature, samples=1000):
    X_what_if_before_pred = X_test [X_test[feature_prefix + feature] == 1]
    random_indices = random.sample(range(len(X_what_if_before_pred)), samples)
    X_what_if = X_what_if_before_pred.iloc[random_indices].copy()
    X_what_if[feature_prefix + target_feature] = 1
    X_what_if[feature_prefix + feature] = 0
    y_pred = clf.predict(X_what_if)
    y_orig = clf.predict(X_what_if_before_pred.iloc[random_indices])
    change = (y_pred != y_orig).mean()
    if np.mean(y_pred - y_orig) < 0:
        direction = "decreased"
    else:
        direction = "increased"
    print("Proportion of predictions changed after changing {} with {} for the sample: {} {:.2f}%"
          .format(feature, target_feature, direction, change * 100))

This function would randomly select on data having a particular feature and then replace it with target_feature to see the difference in prediction after changing.

What-if: Sex¶

In [ ]:
bin_what_if(clf_xgb, 'sex_ ', 'Male', 'Female', 1000)
bin_what_if(clf_xgb, 'sex_ ', 'Female', 'Male', 1000)
Proportion of predictions changed after changing Male with Female for the sample: increased 2.90%
Proportion of predictions changed after changing Female with Male for the sample: decreased 1.60%

It can be observed that female are predicted to be more likely to earn '>50K' than male slightly

What-if: Race¶

In [ ]:
bin_what_if(clf_xgb, 'race_ ', 'Black', 'White', 1000)
bin_what_if(clf_xgb, 'race_ ', 'White', 'Black', 1000)
Proportion of predictions changed after changing Black with White for the sample: increased 0.60%
Proportion of predictions changed after changing White with Black for the sample: decreased 2.50%

It can be observed that race_white are predicted to be more likely to earn '>50K' than race_black slightly.

What-if: Maritial-status¶

In [ ]:
bin_what_if(clf_xgb, 'marital-status_ ', 'Married-civ-spouse', 'Divorced', 1000)
bin_what_if(clf_xgb, 'marital-status_ ', 'Divorced', 'Married-civ-spouse', 1000)
bin_what_if(clf_xgb, 'marital-status_ ', 'Married-civ-spouse', 'Never-married', 1000)
bin_what_if(clf_xgb, 'marital-status_ ', 'Never-married', 'Married-civ-spouse', 1000)
Proportion of predictions changed after changing Married-civ-spouse with Divorced for the sample: decreased 29.50%
Proportion of predictions changed after changing Divorced with Married-civ-spouse for the sample: increased 24.80%
Proportion of predictions changed after changing Married-civ-spouse with Never-married for the sample: decreased 25.70%
Proportion of predictions changed after changing Never-married with Married-civ-spouse for the sample: increased 13.00%

It can be observed that Married-civ-spouse are predicted to be much more likely to earn '>50K' than any other marital status. In this case, this is the primary feature in ranking in the interpretation conducted before

What-if: Native-country¶

In [ ]:
bin_what_if(clf_xgb, 'native-country_ ', 'United-States', 'Other', 1000)
bin_what_if(clf_xgb, 'native-country_ ', 'Other', 'United-States', 1000)
Proportion of predictions changed after changing United-States with Other for the sample: decreased 2.00%
Proportion of predictions changed after changing Other with United-States for the sample: increased 1.90%
Proportion of predictions changed after changing Other with United-States for the sample: increased 1.90%

It can be observed that native-country_United-States are predicted to bemore likely to earn '>50K' than native-country_United-States slightly.

Conclusion¶

In conclusion, the what-if step is a useful technique for detecting bias in machine learning models. By changing the values of certain features, we can observe how the model's predictions are affected, and this can help us identify which features are most influential and which groups of people may be affected by bias. Through what-if analysis, we can also see the extent and direction of bias in the model and make adjustments to mitigate it. Overall, the what-if technique enables us to gain a deeper understanding of the biases in our models and take steps to ensure that they are fair and accurate.

Step 5: insightful documentation for the dataset and your most fair model¶

5.1 Datasheet for dataset¶

Data information¶

  • Name: Adult Census Income
  • Source: https://archive.ics.uci.edu/ml/datasets/Adult
  • Number of instances: 48842 (32561 for training, 16281 for testing)
  • Number of attributes: 15

Attributes¶

  • Age: the age of the individual represented in the record.
  • Workclass: the type of employer or organization that the individual works for.
  • Fnlwgt: a weighting factor that adjusts for bias in the sample data.
  • Education: the highest level of education attained by the individual.
  • Education-num: the highest level of education attained by the individual, as represented by a numeric value.
  • Marital-status: the current marital status of the individual.
  • Occupation: the type of job or occupation held by the individual.
  • Relationship: the relationship of the individual to the household head.
  • Race: the self-reported race of the individual.
  • Sex: the biological sex of the individual.
  • Capital-gain: the amount of capital gains earned by the individual.
  • Capital-loss: the amount of capital losses incurred by the individual.
  • Hours-per-week: the number of hours worked per week by the individual.
  • Native-country: the country of origin of the individual.
  • Income: a binary variable indicating whether the individual's income is greater than $50,000 per year.

Motivation¶

  • The purpose of the dataset was to provide a benchmark dataset for machine learning algorithms to predict income levels based on demographic information.
  • The dataset was created with a specific task in mind, which is to predict income levels based on demographic information.
  • The dataset was created to fill a gap in the availability of benchmark datasets for the task of predicting income levels.
  • The dataset was created by a team of researchers, including Ronny Kohavi and Barry Becker, at Silicon Graphics Inc. on behalf of the UCI Machine Learning Repository.
  • The creation of the dataset was not specifically funded by a grant.

Composition¶

  • The instances in the dataset represent individual adult records from the 1994 U.S. Census. There is only one type of instance in this dataset, which is individual adult records.
  • The dataset contains 48,842 instances in total, including both the training and testing sets.
  • The dataset is a sample of instances from a larger set, which is the 1994 U.S. Census.
  • The sample is not necessarily random, as it was collected from the Census dataset.
  • Each instance consists of demographic information such as age, education level, occupation, and marital status for each person, as well as information about each person's income.
  • There is a binary target variable associated with each instance, which indicates whether an individual's income is greater than $50,000 per year.
  • The dataset does not contain any missing information.
  • There are no relationships between individual instances made explicit.
  • The dataset provides a recommended split between training and testing sets, which are already provided in the dataset.
  • There are no known errors, sources of noise, or redundancies in the dataset.
  • The dataset is self-contained and does not rely on external resources.
  • The dataset does not contain any data that might be considered confidential or offensive.
  • The dataset does not identify any subpopulations.
  • It is not possible to identify individuals directly from the dataset.
  • The dataset contains data that reveals race, gender, and potentially other sensitive information

Collection process¶

  • The data associated with each instance in this dataset was acquired through a survey conducted by the US Census Bureau.
  • The data was reported by survey respondents and was collected directly from them.
  • The specific mechanisms and procedures used to collect the data are not explicitly described in the dataset documentation.
  • The dataset is a sample from a larger set of survey data collected by the US Census Bureau.
  • The sampling strategy used to select individuals for the survey is not described in the dataset documentation.
  • The individuals involved in the data collection process were survey respondents who were compensated for their participation in the survey.
  • The timeframe over which the data was collected is not explicitly described in the dataset documentation.
  • It is unclear whether any ethical review processes were conducted prior to the data collection process.
  • Since the data was acquired through a survey, the individuals in question were notified about the data collection and gave their consent to participate in the survey.
  • The exact language and procedures for obtaining consent are not described in the dataset documentation.
  • It is unclear whether individuals were provided with a mechanism to revoke their consent in the future or for certain uses.
  • It is also unclear whether an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) was conducted.

Preprocessing/cleaning/labeling¶

  • The data was preprocessed and cleaned before being included in the dataset.
  • Missing values were removed from the dataset.
  • The data was labeled with a binary income label indicating whether an individual's income is greater or less than $50,000.
  • The software used to preprocess, clean, and label the data is not described in the dataset documentation.
  • It is unclear whether the "raw" data was saved in addition to the preprocessed/cleaned/labeled data.
  • It is also unclear whether the software used for preprocessing/cleaning/labeling the data is available.

Uses¶

  • The dataset has been used for various classification and prediction tasks related to income.
  • The dataset has been used in several research papers, and there are repositories available that link to these papers and systems.
  • The dataset could potentially be used for other classification or prediction tasks related to demographic or socioeconomic characteristics.
  • The dataset documentation does not mention any specific concerns about the composition of the dataset that might impact future uses.
  • It is important to note that the dataset should not be used for discriminatory purposes or to perpetuate harmful biases or stereotypes.
  • There may be tasks for which the dataset is not appropriate, such as those that require a more granular level of income information or that rely heavily on specific demographic or socioeconomic characteristics. It is up to the dataset consumer to carefully consider the appropriateness of the dataset for a particular task.

Distribution¶

The UCI Adult dataset is available in the UCI Machine Learning Repository, which is a popular repository for machine learning datasets. The repository is managed by the Department of Information and Computer Sciences at the University of California, Irvine. It offers access to a vast collection of machine learning datasets, including the UCI Adult dataset, and is a useful resource for researchers and professionals who want to utilize machine learning and data analysis techniques.

Missing list¶

  • IP licence belonging
  • Maintanence

5.2 Model cards¶

Model Details:

  • Model Developer: Chengkai Dai
  • Model Date: 27/02/2023
  • Model Version: 1.0
  • Model Type: Decision tress
  • Training Algorithms: CART (Classification and Regression Trees) algorithm was used to train the decision tree model
  • Fairness Constraints: 'max_depth': range(1, 4), 'min_samples_split': range(2, 7, 2), 'min_samples_leaf': range(1, 6)
  • Features: Age, workclass, education level, marital status, occupation, relationship, race, sex, capital gain, capital loss, and hours per week were used as input features.
  • Citation Details: https://christophm.github.io/interpretable-ml-book/simple.html
                  https://www.educba.com/decision-tree-hyperparameters/
                  https://pdfs.semanticscholar.org/7b99/2c46800f8913c251259c1d981a7695105c5a.pdf
  • Contact Information: chengkai.dai.22@ucl.ac.uk

Intended Use:

  • Primary Intended Uses: The decision tree model is intended to classify whether an individual has an income of over $50,000 based on their demographic and employment information.
  • Primary Intended Users: Researchers, data analysts, and organizations working on income prediction or related tasks.
  • Out-of-Scope Use Cases: The model is not intended for predicting other financial outcomes or for use in high-stakes decisions such as employment or lending decisions.

Factors:

  • Relevant Factors: Age, workclass, education level, marital status, occupation, relationship, race, sex, capital gain, capital loss, and hours per week.
  • Evaluation Factors: Accuracy, precision, recall, F1 score, and fairness metrics such as equal opportunity or demographic parity.

Metrics:

  • Model Performance Measures: Accuracy, precision, recall, and F1 score
  • Decision Thresholds: Threshold of 0.5 was used for classifying individuals as having income over $50,000
  • Variation Approaches: Cross-validation was used to evaluate the model's performance and handle variation in the dataset

Evaluation Data:

  • Datasets: UCI Adult dataset (https://archive.ics.uci.edu/ml/datasets/adult)
  • Motivation: The dataset was used to evaluate the performance of the decision tree model for income prediction.
  • Preprocessing: Missing values were imputed, features with low proportion are classified as other, and categorical variables were one-hot encoded.

Training Data:

  • Datasets: UCI Adult dataset (https://archive.ics.uci.edu/ml/datasets/adult)
  • Motivation: The dataset was used to train the decision tree model for income prediction.
  • Preprocessing: Missing values were imputed, and categorical variables were one-hot encoded.

Quantitative Analyses:

  • Unitary Results: Accuracy: 0.84, precision: 0.76, recall: 0.51, F1 score: 0.61

Ethical Considerations:

  • Description: The decision tree model should not be used to make high-stakes decisions such as employment or lending decisions without additional scrutiny and validation.
  • Mitigation Strategies: The model's performance should be evaluated on fairness metrics such as equal opportunity or demographic parity to ensure it does not unfairly discriminate against certain groups.

Caveats and Recommendations:

  • Description: The decision tree model's performance may vary depending on the quality and representativeness of the dataset used for training and evaluation. The model should be regularly monitored and updated to ensure it remains relevant and accurate over time. Additional fairness constraints or preprocessing techniques may be applied to address potential bias or discrimination in the model's predictions.